An Improvement to Levenshtein's Upper Bound on 
the Cardinality of Deletion Correcting Codes 
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Abstract — We consider deletion correcting codes over a q-ary 
alphabet. It is well known that any code capable of correcting s 
deletions can also correct any combination of s total insertions 
and deletions. To obtain asymptotic upper bounds on code size, 
we apply a packing argument to channels that perform different 
mixtures of insertions and deletions. Even though the set of codes 
is identical for all of these channels, the bounds that we obtain 
vary. Prior to this work, only the bounds corresponding to the all 
insertion case and the all deletion case were known. We recover 
these as special cases. The bound from the all deletion case, due 
to Levenshtein, has been the best known for more than forty five 
years. Our generalized bound is better than Levenshtein's bound 
whenever the number of deletions to be corrected is larger than 
the alphabet size. 

I. Introduction 

Deletion channels output only a subsequence of their input 
while preserving the order of the transmitted symbols. Deletion 
channels are related to synchronization problems, a wide 
variety of problems in bioinformatics, and the communication 
of information over packet networks. This paper concerns 
channels that take a fixed length input string of symbols drawn 
from g-ary alphabet and a delete a fixed number of symbols. In 
particular, we are interested in upper bounds on the cardinality 
of the largest possible s-deletion correcting codebook. 

The first such upper bound is due to Levenshtein. He derived 
asymptotic upper and lower bounds on the sizes of binary 
codes for any number of deletions Q. These bounds easily 
generalize to the g-ary case. He showed that the Varshamov 
Tenengolts (VT) codes, which had been designed to correct a 
single asymmetric error ifKfl . ifTTl . could be used to correct 
a single deletion. The VT codes meet the upper bound and 
establish its tightness in the case of a binary alphabet and a 
single deletion. 

Since then, a wide variety of code constructions, which 
provide lower bounds, have been proposed for the deletion 
channel and other closely related channels. One recent con- 
struction uses constant Hamming weight deletion constructing 
codes (2). In contrast, progess on upper bounds has been 
rare. Levenshtein eventually refined his original asymptotic 
bound (and the parallel nonbinary bound of Tenengolts) into 
a nonasymptotic version [7|. Kulkarni and Kiyavash recently 
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proved a better upper bound for an arbitrary number of 
deletions and any alphabet size H). 

Another line of work has attacked some related combina- 
torial problems. These include characterizion of the sets of 
superstrings and substrings of any string. Levenshtein showed 
that the number of superstrings does not depend on the starting 
string |6|. He also gave upper and lower bounds on the number 
of substrings using the number of runs in the starting string 
0. Calabi and Hartnett gave a tight bound on the number of 
substrings of each length [Ij. Hirschberg extended the bound 
to larger alphabets [3|. Swart and Ferreira gave a formula for 
the number of distinct substrings produced by two deletions 
for any starting string J5]- Liron and Langberg improved and 
unified existing bounds and constructed tightness examples 
0. Some of our intermediate results contribute to this area. 

A. Upper bound technique 

To derive our upper bounds, we use a simple and general 
packing argument that can be applied to any combinatorial 
channel. Any combinatorial channel can be represented by 
a bipartite graph. Channel inputs correspond to left vertices, 
channel outputs correspond to right vertices, and each edge 
connects an input to an output that can be produced from it. 
If two channel inputs share a common output, they cannot 
both appear in the same code. The degree of a input vertex in 
the graph is the number of possible channel outputs for that 
input. If the degree of each input is at least r and there are N 
possible outputs, any code contains at most N/r codewords. 

For a channel that makes at most s substitution errors, this 
argument leads to the well known Hamming bound. Any code 
capable of correcting s deletions is also capable of correcting 
any combination of s total insertions and deletions (See 
Lemma |2j. Despite this equivalence, this packing argument 
produces different upper bounds for channel that perform 
different mixtures of insertions and deletions. Prior to this 
work, the bounds coming from the s-insertion channel and 
the s-deletion channel were known. 

For the s-insertion channel, each g-ary n-symbol input has 
the same degree. For fixed q and s, the degree is asymptotic 
to (")(g — l) s (See (Q])). There are q n+s possible outputs, so 



an s-insertion correcting code contains asymptotically at most 
? n+ 7 (")(<?- l) s codewords. 

The s-deletion case is slightly more complicated because 
different inputs have different degrees. The input strings 
consisting of a single symbol repeated n times have only a 
single possible output: the string with that symbol repeated 
n — s time. Consequently, using the minimum degree over all 
of the inputs yields a worthless bound. The following argument 
is due Levenshtein [5|. The average degree of an input is 
asymptotic to (^^j (™) anc ' most inputs have a degree close 
to that. The inputs can be divided into two classes: those with 
degree at least 1 — e times the average degree and those with 
smaller degree. For an appropriately chosen e that goes to 
zero as n goes to infinity, the vast majority of inputs fall 
into the former class. Call members of the former class the 
typical inputs. The minimum degree argument can be applied 
to bound the number of typical inputs that can appear in a 
code. There are q n ~ s possible outputs so asymptotically at 
most q n I (") (q — l) s typical inputs are in a code. We have no 
information about how many of the atypical inputs appear in 
a code, but the total number of atypical inputs is small enough 
to not affect the asymptotics of the upper bound. 

The two bounds have the same growth rates, but the bound 
on deletion correcting codes is a factor of q s better than the 
bound on insertion correcting codes, despite the fact that any 
s-deletion correcting code is an s-insertion correcting code and 
vice versa. Note that there is no possible improvement to the 
insertion channel bound from dividing the inputs into typical 
and atypical classes. 

We extend this bounding strategy to channels that perform 
both deletions and insertions. We obtain a generalized upper 
bound that includes Levenshtein's bound as a special case. 
Recall that Levenshtein's bound was previously known to be 
tight for one deletion and alphabet size two. The new bound 
improves upon the previously known bounds whenever the 
number of deletions is greater than the alphabet size. 

The rest of the paper is organized as follows. In Section [EI] 
we present some notation and basic results on deletion and 
insertion channels. In Section HTT1 we construct a class of well- 
behaved edges in the channel graph. Together with an upper 
bound on the number of edges in channel graph, the size of this 
class establishes the asymptotics of the average input degree. 
In Section IIVI we prove a lower bound on the degree of each 
input vertex and use it to establish our main result: an upper 
bound on the size of an s-deletion correcting code. 

II. Preliminaries 

A. Notation 

Let N be the set of nonnegative integers. Let [n] be the set 
of nonnegative integers less than n, {0, l..n — 1}. Let 2*- n ' be 
the family of of subsets of [n]. Let be the family of k 
element subsets of [n]. Let [q] n be the set of g-ary strings of 
length n. Let [q\* be the set of g-ary strings of all lengths. 

We will need the following asymptotic notation: let a(n) ~ 
b(n) denote that liirtn^oo ^4 = 1 and a(n) < b(n) denote 



that lim n _}.oo ffev < !• We will use the following asymptotic 
equality frequently: for fixed c, (™) ~ 

B. Deletion and insertion channels 

We will formalize the problem of correcting deletions and 
insertions by defining deletion and insertion channels. The a- 
deletion fe-insertion channel takes a string of length n, finds 
a substring of length n — a, and outputs a superstring of that 
substring of length n — a + b. For strings x and y, write x < y 
if a; is a substring of y and define the following sets. 

Definition 1. For x £ [q] n , define S S fi(x) = {z £ {q] n ~ s ■ 

z < x}, the set of substrings of x that can be produced by s 
deletions. Define So tS (x) — {w £ : w > x}, the set of 

superstrings of x that can be produced by s insertions. Define 

Sa,b( x ) = U, e s a , (*) So,b( z )- 

If x is the input to an n-symbol a-deletion 6-insertion 
channel, S a> b(x) is the set of possible outputs. 

When two inputs share common outputs they can potentially 
be confused by the receiver. We are interested in codes that 
allow the correction of s deletions. The following sets allow 
the definition of channels that perform both both types of 
synchronization errors. 

Definition 2. A q-ary n-symbol a-deletion b-insertion correct- 
ing code is a set C C [q] n such that for any two distinct strings 

x,y eC, S a ,b(x) n S a ,b(y) « empty. 

Lemma 1. For l,m,n £ N with I < m and I < n, let x £ [q] m 
and y £ [q] n . Then there exists z £ [q] 1 such that x > z and 
y > z if and only if there exists w £ [q] m +"~ i such that 
w > x and w > y. 

Lemma 2. Fora,b,n£ N, x,y £ [q] n , D a+b (x)C\D a+b (y) = 
if and only if S a j,{x) n S a .b(y) — Any (a + b)-deletion 
correcting code is also an a-deletion b-insertion correcting 
code. 

Proof: Suppose there is some z £ D a+ i,(x) n D a+ b(y). 
Then there are u,v £ [2} n ~ a such that x > u > z and y > 
v > z. The length of z is n — a — b so by Lemma Q] there is 
some w of length (n — a) + (n — a) — (n — a — b) = n — a + b 
such that w > u and w > z. Thus w is in both S at b(%) and 
S a ,b(y)- All of these implications are also true in reverse. ■ 

Definition 3. Let i? gi z j0 ,& be a bipartite graph with left vertex 
set [q] l+a and right vertex set [q] l+b . Vertices are adjacent if 
they have a common substring of length I. 

If £ is a left vertex of B q ^ : a : b, then its neighborhood is 
S n -i,m-i(x). This graph completely describes the behavior 
of the n-symbol (n — /)-deletion (m — Z)-insertion channel. 

Each x £ [q] n ~ s has the same number of superstrings of 
length n: 

\So, s {x)\ = Iq, 3 ,n, (1) 

where 

w = E (")(?- 1) 4 - 

i=0 ^ ' 



III. Constructing edges 

To execute the strategy described in section II-AI we need 
a lower bound on the degree of a channel inputs. This is a 
lower bound on the degree of a left vertex of -B gi ;. a > To 
obtain this bound, we will first construct a subset of the edges 
of Bq j a j, that is easier to work with than the complete edge 
set. Our ultimate lower bound on the degree of an input will 
actually be a lower bound on the number of edges for this 
subset incident to the input vertex. 

One way to get information about the size of a target set 
T is to find a construction function / : P —> T, where P is 
an easily counted parameter set. If / is injective, then \P\ = 
\f(P) \ and \P\ < \T\. We can demonstrate the injectivity of / 
with a deconstruction function g : T — s- P that is a left inverse 
of /. This means that g(f(p)) — p for all p G P. If the function 
g is given a constructable member of A, it deconstructs it 
and recovers the construction parameters. Similarly, if / is 
surjective, then we can find an injective g : T — > P that is a 
right inverse of /, so \T\ = \g(T)\ and \P\ > \T\. If / is both 
injective and surjective, then \P\ — \T\. 

In this section we apply these methods to the edge set of 
Bq,i,a,b- We give an upper bound on the number of edges 
and discuss why it is difficult to count the edges exactly. We 
explain our construction of a subset of the edges. Finally we 
show that the upper and lower bounds match asymptotically. 

A. An upper bound 

By definition, two vertices in B qt i <a j, are adjacent if they 
share a substring of length I. This makes the common substring 
a natural construction parameter for the edge. We can construct 
an edge by starting with a string of length I, performing a 
arbitray insertions to get the left vertex, and performing b 
arbitrary insertions to get the right vertex. 

Lemma 3. For all q, n, a, b G N with s = a + b, the number 
of edges in Bg,/, a ,& satisfies 

\E(B q j^ a ^ b )\ < q ' Iq,a,l+a,Iq,b,l+b 



Proof: There are q l l q , a j+al q ,bj+b triples (z,x,y) € 
[q] 1 x [q] n x [q] m such that z < x and z < y. If x G [q] n and 
y G [2] m are adjacent, then they have at least one common 
substring of length I and appear in at least one triple. ■ 
This upper bound is not an equality because many pairs of 
strings (x.y) G [q] n x [q] m have multiple common substrings 
z E [q]\ Pairs of strings with multiple common substrings of 
length I fall into two classes. Pairs in the first class have a 
common substring of length more than I. Call this string w. 
In this case, every substring of length I of w is a common 
substring of the pair. Pairs in the second class have multiple 
maximum length common substrings. For example, the strings 
0101 and 1010 have both 010 and 101 as substrings. 



To determine the exact number of edges in £? 9i z, a ,&, it is 
necessary to determine the sizes of both classes. The size of 
the first class can be found easily if the number of edges in 
Bq.i+i,a-i.b-i is known for all i up to min(a, b). It is more 
difficult to characterize the vertex pairs of the second class. 
Consequently, our lower bound will also not be tight. 

B. A lower bound 

We have constructed every edge at least once by starting 
with every possible common substring and performing all pos- 
sible insertions. By using a restricted set of starting substrings 
and allowed insertions, we will construct each edge at most 
once. Specifically, we will require that the interval between 
two insertion points is not alternating. 

Definition 4. A string is alternating if some u G [q] appears 
at all even indices, some v G [q] appears at all odd indices, 
and u ^= v. Let Aq >n be the set of alternating q-ary strings of 
length n. 

The empty string and all strings of length one are trivially 
alternating. For each length n > 2, each of the q choices 
for u and q — 1 choices for v results in a unique string, so 
\Aq, n | = q{q — !)■ The shortest nonaltemating strings have 
length two, so our restriction prevents two insertions from 
occuring occuring too close to each other. 

To formalize the set of allowable starting substrings, we will 
need to following definition. 

Definition 5. Let the family of compositions with t dimensions, 
total multiplicity I, and minimum multiplicity k be 



M{t,l,k) 



e(N\[k]Y 



E 

ie[t] 



Ci = I 



Now we can define the parameter set for the construction 
function in the lower bound. 

Definition 6. For all q,l, a,b G N, let s = a + b and let 

P q ,l,a, b = x(M\{0}) s x |J f[([qr\A q , Ci ) 

Now we give a summary of the role that the different com- 
ponents of P q j.a.b play in construction. The starting substring 
is specified as s + 1 intervals. The length of the ith interval is 
Ci. The total length of the starting substring is I, so the vector 
of interval lengths is an element of M(s + 1,1,0). An interval 
cannot be alternating, so it is an element of [q] Ci \ A q _ Ci . Each 
gap between intervals is filled with an inserted symbol in one 
of the endpoints of edge and nothing in the other endpoint. 
The subset of the gaps that contains the insertions in the left 
end point is (^'). The inserted symbols will always differ 
from the first symbol of the next interval, so there are (q — 1) 
possibilities for the inserted symbol. 

For convenience, we will define our construction and decon- 
struction functions over larger sets. Start with P q ,i, a ,b an d drop 



the interval nonalternation requirement. Then fix s = a + b and 
take the union over a G [s] to get 



Algorithm 1 Construct an edge 



2 W x(M\{0}) s x([ 9 ] T + 1 . 

If we represent the subset c G 2^ as a string c' G [2] s , all three 
terms of the product are lists. To avoid confusion between 
these bits and the g-ary symbols that appear everywhere else, 
we will use the two element set LR = {Left, Right} rather 
than [2] = {0, 1}. To get the final parameter set, we regroup: 
[q\* x (LR x ([<?] \ {0}) x [q]*) s . 

Our construction function is Algorithm Q] and our decon- 
struction function is Algorithm |2] These will treat strings as 
lists of symbols. We represent the empty list as e. The function 
Head returns the first symbol of a nonempty list and the 
function Tail returns everything except the head. The function 
Length returns the number of symbols in the string. 

The Construct function iteratively builds up a pair of 
strings. CONSTRUCT calls INSERT once per iteration. The 
Insert function takes one of the triples described above as an 
argument and outputs two strings. The two strings are equal to 
the third component of the triple except that a single symbol 
has been inserted at the head of one of the outputs. 

The Deconstruct repeatedly calls Delete. Each 
Delete undoes the effect of an Insert. Delete takes a 
pair of strings x and y that differ in their first symbol. 
Delete must pick one of these symbols to be the first 
symbol of a common substring of the input strings. To decide 
which first symbol to keep, Delete calls Match twice. 
The Match function takes two strings x and y, finds their 
longest common prefix, and outputs the prefix and the two 
corresponding suffixes. Delete calls Match on (Tail(s), y) 
and (x, TAlL(y)) and then preforms the deletion that resulted 
in a longer common prefix. The information about the deletion 
and prefix become a triple. DELETE returns this triple along 
with two suffices from the match. 

C. Deconstructability 

Now we will show that DECODE is a left inverse of 
Encode. The first step is to look at the inner functions: 
Insert and Delete. 

Lemma 4. For h G LR, 8 G [q] \ {0}, and w G [q] m \ A q%m , 
let (x,y) = lNSERT(Zr, 8, w). Let u and v be arbitrary q-ary 
strings with different first symbols. Then Delete(:e : u,y : 

v) = ((h,8,w),u,v). 

Proof: Let w — (wq,W±, . . . ,w m -±). Without loss of 
generality let Ir = LEFT, so x = (wq +8) : b and y = b. First, 
Delete computes g = (wq + S) — wo = S. Next, it evaluates 
Match(u> : u,w : v) and obtain (w,u,v) because uo ^ vq. 
Thus the length of the first match is LENGTH (b) — m. Second, 
it evaluates Match((i«o + 8) : w : u, w : v). If the length of 
the second match is at least m — 1, then Wq + 8 = w± and 
Wi = Wi+2 for < i < m — 3. This would make w alternating, 
so the length of the second match is at most m — 2. The first 
match is longer than the second, so the first branch of the if 



Construct : [q\* x (LRx ([<?] \{0}) x [q]*) s ->■ [q]* x [q\* 
Construct(z ,£) 

X Zq 

while z^edo 

(u, v) <- Insert (Head (t)) 

t <- TAIL(t) 

x ^— x : u 

y <^y:v 
end while 
return (x, y) 

Insert : LR x ([q] \ {0}) x [q]* [q\* x [q]* 
lNSERT(Zr, 8, w) 

w' (8 + Head(w)) : w 

if Ir = Left then 
return (w' , w) 

else 

return (w,w') 
end if 



statement is taken and the function returns ((Left, 8, w), u, v). 



Lemma 5. For < i < s let Wi be q-ary string that is 
not alternating. For 1 < i < s let hi G {LEFT, RIGHT}, let 
$i ^ [q] \ {0}, and let ti = (hi, 8i, Wi). Let t — (t\, . . . , t s ). 
Then DECONSTRUCT(CONSTRUCT(wo, t)) = (wo 7 t). 

Proof: The strings output by CONSTRUCT are the con- 
catenation of s + 1 intervals. The first pair of intervals are 
equal to zq. The remaining s pairs are each produced by one 
call to Insert, so their initial symbols differ. Consequently, 
the initial call to Match in Deconstruct finds z . After 
that, the conditions of Lemma[4]are met. Each call to DELETE 
recovers the input to one INSERT operation and preserves the 
conditions of Lemma |4] ■ 

Lemma 6. For all q,l,a,b G N, there are func- 
tions Construct : Pq,i,a,b — > E(B q ^. a .b) and 
Deconstruct : E(B q ^. a .b) P q .i, a ,b such that 

DECONSTRUCT(CONSTRUCT(p)) =pfor all p G Pq.i.a.b- 

Proof: From Lemma|5] DECONSTRUCT is a right inverse 
of Construct. The discussion at the beginning of this 
section describes the bijection between P q ,i,a,b and the input to 
Construct. It is easy to check that the strings produced by 
Construct have lengths I + a and I + b and have a common 
substring of length I. ■ 

Lemma 7. For fixed q,a,b G N, \P q ,i,a,b\ > q 1 ^ (*) {q- l) s . 



Proof: First, 

For a > 2, \[q} c *\A qiC 



I) md\([q)\{0}y\ = (q-iy. 
q c * — q(q — 1). The number of 



Algorithm 2 Deconstruct an edge 



Deconstruct: [q]*x[q]* ->• [q]*x(LRx([q]\{0})x[q]*) s 
Deconstruct^, y) 

(z ,x,y) 4- Match(x,?/) 

f <- e 

while i^eAf/^edo 

(w,x,y) <- Delete(x,7/) 

t <- f : i/j 
end while 

assert x — e Ay = e 
return (z ,t) 

Delete: [q]*x[q]* ->■ (LRx ([a]\{0}) x [q]*) x [q]* x [q]* 
Delete(x, y) 

g = Head(x) - HEAD(y) 

(a, 6, c) 4- Match(Tail(iz;), y) 

(d,e,f) <- Match(x,Tail(?/)) 

assert Length(gi) ^ LENGTH(d) 

if LENGTH(a) > LENGTH(d) then 
return ((Left, g, a), 6, c) 

else 

return ((Right, {-g), d), e, /) 
end if 

Match : [g] i x [qp' -> [q] k x [q]*~ fe x [qp'~ fc 
Match(ie, y) 
w <— e 

while i^6Ai/^eA Head(:e) = HEAD(y) do 

u> iu : Head(x) 

x <- Tail(:k) 

y <- TAIL(y) 
end while 
return (w,x,y) 



sequences of strings is 



c6M(a+l,i,2) »=0 

^ E ri(i-? 2 - c o 

cGM(s+l,/,2) i=0 

>q< e fl(i-^ los '0 

cGM(s+l,Z,2+log Z) 2=0 



Because \M(t,l,k)\ = (' +(1 t l fe 1 ) *" 1 ), this equals 

a-(i + io g? 0(s + i)- A (1 _ rl)s+1 ^ ( 



<1 



Our bounds establish the asymptotic growth of the number 
of edges. 

Theorem 1. The number of edges in B q j^ a ^ satisfies 
\E(B qMib )\ ~ g'OO^-l) 8 . T/ie average ofS a>b (x) over 
all x £ [q] n is asymptotic to (")(*)(<? - l) s q~ a . 



Proof: The first claim follows immediately from Lemmas 
12 16] and|7] For x € [<z]™, the set S a ,b(x) is the neighborhood 
of x in Bq^n^aaj,. Each edge involves exactly one of the q n 
left vertices and (™; Q ) - (™). ' ■ 

Now we can conclude that most edges are constructable by 
our method. This is a necessary condition for the asymptotic 
tightness of our ultimate lower bound on input degree. 

IV. Bounds on Input Degree and Code Size 

Lemma 8. Let x € [q] n be a string with r runs. Let c be the 
length the longest alternating subinterval of x. Then \S a b(x)\, 
the number of unique strings that can be produced from x by 
a deletions and b insertions, is at least 



r-a-2-(a+ l)c\ (n - 2a - 1 - (2a + b + 1)6 



(<z-i) 6 



Proof: Given a string x E [q] n , we identify a subset of 
Pq,n-a.a.b Application of CONSTRUCT from Lemma |6] to any 
member of this subset produces an edge with left endpoint x. 
By Lemma |6l each left endpoint is produced at most once. 

We select a symbols of x for deletion, select b spaces 
between symbols for insertion, and specify the b new symbols. 
The selected symbols and spaces partition x into s+1 intervals. 
To ensure that none of these intervals are alternating, we will 
require that all of the intervals contain at least c + 1 symbols. 

Deleting any of the symbols in a run has the same effect, 
so we will select symbols that occur at the left ends of runs. 
We cannot select the last symbol of the string, so there are 
r — 1 symbols to choose from. To ensure that there are at 
least c+1 symbols between consecutive deleted symbols, we 
will require that there be c + 1 deletable symbols. There are 
^■r-i-(a+i)(c+i)j wa y S t0 ^ S y m bols for deletion. 

There are m — 1 potential spaces to make an insertion. 
Insertions cannot be done in the c+1 spaces before and after 
a deleted symbol. In the worst case, all of these forbidden 
spaces are distinct, leaving n + 1 — 2a(c+ 1) spaces to choose 
from. There must be c + 1 symbols and c spaces between any 
two consecutive chosen spaces. Thus there are always at least 

(«-l-2a(c+l)-(&+l)c) ways tQ pick the spaces ^ 

Finally, for each of the b insertion points, we must specify 
the inserted symbol. To do this, for each insertion point we 
pick 6 G [q] \ {0} and make the new symbol equal to S plus 
its successor. Thus, there are (q — l) b choices for this step. 
Slight rearrangement gives the claimed result. ■ 
To apply Lemma [8] to a string, we need two statistics of 
that string: the number of runs and the length of the longest 
alternating subinterval. The next two lemmas concern the 
distributions of these statistics. 

Lemma 9. The number of q-ary strings of length n with an 
alternating subinterval of length at least c is at most (n — c + 
l)q"- c+1 (q- 1) . 

Proof: A string of length n contains n — c + 1 intervals of 
length c. If some subinterval of length at least c is alternating, 
at least one of subsintervals of length exactly c is alternating. 



There are q{q—X) choices for the alternating interval and q n c 
choices for the remaining symbols. ■ 

Lemma 10. The number of q-ary strings of length n with 



n p — 2(n— l)e 



^ ej (n — 1) + 1 or fewer runs is at most q n e 

Proof: For x £ [q] n , let x' £ [a]™ -1 be the string of first 
differences of x. That is, let x' { = x i+ i —x t mod q. If x has r 
runs, then x\ is nonzero at the r — 1 boundaries between runs. 
Thus there are q(™Zi)(q ~ l) r_1 strings with exactly r runs. 
Now we can apply the following Chernoff inequality: 



E 



n - 1 



(9 - 1)* < g^e 



-1. -2(n-l)e 



Now we have all of the ingredients required to execute the 
strategy described in Section II-AI 

Theorem 2. For fixed q, a, b £ N anc/ s = a + 6, f/ze number 
of codewords in an n-symbol q-ary a-deletion b-insertion 
correcting code is asymptotically at most 

qn+b 



(i -i) s C)(tY 

Proof: We divide strings into three classes: typical strings, 
strings with a long alternating subinterval, and strings with few 
runs. Call these classes C\, C%, and C3 respectively. 

A string will fall into C2 if it has an alternating subinterval 
of length at least c. If we let c = (s + 2) log 9 n, then by 
Lemma|9]we have |C 2 | < nq n - c+1 (q-l) = n-^ s+ ^q n+1 {q- 
1) which is 0(q n /n s+1 ). 

The average number of runs is ^= (n — 1) + 1. A string will 



fall in the third class if it has at most 
runs. If we let e = 



(i=!-e) (n-l) + 



^2(n.- if " ' men k v Lemma [TOl we have 

|C 3 | < g"e- 2( "- 1)e2 = q« e -("+i)i°g« = q n / n s +\ 

For fixed s, this e is o(l), so ~ e ) (n-l) + l~ ^2^2. 

Now we can apply Lemma [8] to lower bound the degree 
of the typical strings. As before, let s = a + b. The first 
multiplicative term in the lower bound is asymptotic to 

'^n - (a + l)(a + 2) log, n\ ^ H-^n\ 
a J \ a 

q -l \ In" 
q 1 \" , 



The second term is asymptotic to 



n - (2a + 6+ l)(s + 2)log 9 n 



Thus 



min |5 a . b (x)| > 



g-1 

q J WU 

(g-i)' M ^ 

g a W V6 



(g-i) fc 



There are n — a + b possible outputs, so the number of 
codewords is at most 

n n—a+b 

+ \c 2 \ + \c 3 \ 



min^gci \S a ,b(x)\ 



< 



q 



n—a-\-b 



q n+1 (q-l) 



,s+l 



,s+l 



n n+b 



(?-!)•(:)(; 



By setting b to zero we recover Levenshtein's upper bound. 
The generalized bound offers an improvement whenever 1 is 
a better choice for b than 0. This occurs whenever s > q. 

The best bound is achieved when is minimized. This 

U) 

occurs when b — f^fl- If we pick this value for b, there is 
some c £ [q+l] such that b = s ~^ c and s = b(q+l)+q—c 
Then the improvement over Levenshtein's bound is 

b\q 



H ' y i=0 



b-1 



_ 6" e 

&y ~ w - "7b 



The first inequality is true because c < q and i < b. The 
second inequality comes from Stirling's approximation. 
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